perf: [iceberg] Use protobuf instead of JSON to serialize Iceberg partition values by parthchandra · Pull Request #3247 · apache/datafusion-comet

parthchandra · 2026-01-22T22:16:54Z

Rationale for this change

We see increased GC collection times in jobs with Iceberg scans with a large number (10K-100K) of partitions

What changes are included in this PR?

Partition values are currently serialized to native by constructing a JSON string. This PR changes that to use Protobuf instead.

How are these changes tested?

Added a new unit test for a large number of partitions.

AI note: large parts were generated using Claude Code.

hsiang-c

Replace JSON with Protocol Buffer LGTM

codecov-commenter · 2026-01-22T22:50:38Z

Codecov Report

❌ Patch coverage is 39.00000% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.07%. Comparing base (f09f8af) to head (e8b87e7).
⚠️ Report is 873 commits behind head on main.

Files with missing lines	Patch %	Lines
.../comet/serde/operator/CometIcebergNativeScan.scala	39.00%	55 Missing and 6 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3247      +/-   ##
============================================
+ Coverage     56.12%   60.07%   +3.94%     
- Complexity      976     1438     +462     
============================================
  Files           119      172      +53     
  Lines         11743    15927    +4184     
  Branches       2251     2631     +380     
============================================
+ Hits           6591     9568    +2977     
- Misses         4012     5031    +1019     
- Partials       1140     1328     +188

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2026-01-22T22:50:42Z

+/// This replaces JSON parsing with direct protobuf deserialization with a more compact
+/// representation (e.g., timestamps as integers vs strings).


nit: we can remove the part of the comment that explains how this used to work

hsiang-c · 2026-01-22T22:51:30Z

+      checkIcebergNativeScan(
+        "SELECT COUNT(*) FROM s3_catalog.db.large_partitioned_test WHERE partition_id IN (0, 50, 99)")
+
+      spark.sql("DROP TABLE s3_catalog.db.large_partitioned_test")


(nit) You can try DROP TABLE s3_catalog.db.large_partitioned_test PURGE to remove files on disk.

andygrove · 2026-01-22T22:53:22Z

+  oneof value {
+    int32 int_val = 2;
+    int64 long_val = 3;
+    int64 date_val = 4;           // days since epoch
+    int64 timestamp_val = 5;       // microseconds since epoch
+    int64 timestamp_tz_val = 6;    // microseconds with timezone
+    string string_val = 7;
+    double double_val = 8;
+    float float_val = 9;
+    bytes decimal_val = 10;        // unscaled BigInteger bytes
+    bool bool_val = 11;
+    bytes uuid_val = 12;
+    bytes fixed_val = 13;
+    bytes binary_val = 14;


We may want to consider consolidating this with the existing Literal defined in protobuf. This does not need to happen for the current PR.

message Literal { oneof value { bool bool_val = 1; // Protobuf doesn't provide int8 and int16, we put them into int32 and convert // to int8 and int16 when deserializing. int32 byte_val = 2; int32 short_val = 3; int32 int_val = 4; int64 long_val = 5; float float_val = 6; double double_val = 7; string string_val = 8; bytes bytes_val = 9; bytes decimal_val = 10; ListLiteral list_val = 11; }

These are Iceberg types though.

andygrove

LGTM pending CI

andygrove · 2026-01-22T23:20:50Z

+  /**
+   * Legacy JSON serialization function - removed in favor of protobuf. Kept as reference for
+   * conversion logic.
   */
  private def partitionValueToJson(fieldTypeStr: String, value: Any): JValue = {


can we just remove this now?

Found some unused code as a result of removing this. Thanks!

andygrove · 2026-01-22T23:21:44Z

@parthchandra there is a clippy failure

… avoid iceberg optimization

andygrove · 2026-01-23T19:24:51Z

@parthchandra do you have benchmarks showing the performance improvement?

parthchandra · 2026-01-23T19:52:19Z

Merged. Thanks @andygrove @hsiang-c

parthchandra · 2026-01-23T21:47:19Z

@parthchandra do you have benchmarks showing the performance improvement?

(Sorry, didn't notice this before merging).
This likely won't have a performance benefit as much as it might help with the GC pressure we see when serializing the plan involving a large number of partitions.
@hisang-c is helping with the profiling runs. We will post some data from the profiling runs here.

parthchandra · 2026-01-23T22:24:01Z

@parthchandra do you have benchmarks showing the performance improvement?

(Sorry, didn't notice this before merging). This likely won't have a performance benefit as much as it might help with the GC pressure we see when serializing the plan involving a large number of partitions.

Didn't seem to have too much of an impact.
FilePartition info went from 2.96% of total allocation to 2.94%

parthchandra requested review from andygrove and mbutrovich January 22, 2026 22:27

hsiang-c approved these changes Jan 22, 2026

View reviewed changes

andygrove reviewed Jan 22, 2026

View reviewed changes

hsiang-c reviewed Jan 22, 2026

View reviewed changes

andygrove reviewed Jan 22, 2026

View reviewed changes

andygrove approved these changes Jan 22, 2026

View reviewed changes

andygrove reviewed Jan 22, 2026

View reviewed changes

parthchandra added 4 commits January 23, 2026 08:40

perf: Use protobuf instead of JSON to serialize Iceberg partition values

07be0eb

format

39aea1e

address review comments, remove unused Json code and fix unit test to…

d332faa

… avoid iceberg optimization

scalastyle

e8b87e7

parthchandra force-pushed the iceberg_scan_protobuf_partition branch from 8319470 to e8b87e7 Compare January 23, 2026 16:50

parthchandra merged commit 077005c into apache:main Jan 23, 2026
222 of 223 checks passed

		/// This replaces JSON parsing with direct protobuf deserialization with a more compact
		/// representation (e.g., timestamps as integers vs strings).

Conversation

parthchandra commented Jan 22, 2026

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

hsiang-c left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

parthchandra Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

hsiang-c Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

parthchandra Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

parthchandra Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove commented Jan 22, 2026

Uh oh!

andygrove commented Jan 23, 2026

Uh oh!

Uh oh!

parthchandra commented Jan 23, 2026

Uh oh!

parthchandra commented Jan 23, 2026

Uh oh!

parthchandra commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jan 22, 2026 •

edited

Loading